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, Genetic regulation is a key component in development, but a clear understanding of the 

■ structure and dynamics of genetic networks is not yet at hand. In this paper we investi- 

gate these properties within an artificial genome model originally introduced by Reil [17]. 
We analyze statistical properties of randomly generated genomes both on the sequence- 

QQ , and network level, and show that this model correctly predicts the frequency of genes in 

■ genomes as found in experimental data. Using an evolutionary algorithm based on stabi- 
' lizing selection for a phenotype, we show that dynamical robustness against single base 
' mutations, as well as against random changes in initial states of regulatory dynamics 

that mimic stochastic fluctuations in environmental conditions, can emerge in parallel. 
Point mutations at the sequence level have strongly non-linear effects on network wiring, 
1 J ■ including as well structurally neutral mutations and simultaneous rewiring of multiple 

|^N( ' connections, which occasionally lead to strong reorganization of the attractor landscape 

and metastability of evolutionary dynamics. Evolved genomes exhibit characteristic pat- 
terns on both sequence and network level. 

Keywords: artificial genome, gene regulatory network, evolution 



1. Introduction 

The transcription of DNA into mRNA and subsequent translation into protein is the 
fundamental genetic process; it is the crucial first step by which genetic information 
gives rise to an organism. Development is not such a linear process, however. By 
binding to specific regions of the genome, the protein produced by one gene can 
affect the activity of other genes, and those genes may in turn express proteins that 
enhance or inhibit still more genes. A network of interactions responsible for the 
regulation of genetic activity is thus defined. Such genetic regulation is important 
if cells are to have independent control over their behavior. 

Today, the available amount of data for regulatory interactions in a number of 
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model organisms, as, for example. Yeast [21] is steadily increasing. A number of 
distinguishing structural properties have been identified, namely scale-free degree 
distributions [11], motifs [6] and modular organization [20]. 

Still, there is not enough information to suggest a comprehensive theory of 
how genetic regulatory networks attain a particular structure, how genes in such 
networks interact and respond to perturbation, and how evolution has shaped these 
factors. This study is an attempt to explore these questions in the context of one 
particular model [17], in the hopes that it has features that correspond to the limited 
data currently available, and so that some progress toward a comprehensive theory 
might eventually be made. 

Traditionally, attempts to understand the characteristics of regulatory networks 
have focused on dynamical properties. That is, a network topology is specified and 
rules are applied to describe how each gene in the network responds to inputs. Some 
initial state is then assigned and the time evolution of gene activity is studied. A 
variety of rules have been used, including Boolean switches [12], thresholds [14,18], 
and differential equations [9]. Much less work has been done in understanding how 
the machinery of transcription, translation, and binding might act throughout the 
genome to produce the topology of a genetic network. In fact, most studies of 
genetic networks ignore modeling DNA-specific processes altogether [5]. The first 
part of our study examines to what extent Reil's model [17], which includes explicit 
parameterizations for transcription and translation, can produce realistic genetic 
networks based on random genome realizations. 

A description of the method we will use for building genetic regulatory networks 
follows, along with comparisons to published and publically available experimental 
data. Statistical properties of random realizations of artificial genomes are derived, 
and related to network structure. Next, we investigate the dynamics of our modeled 
networks when applying threshold dynamics to gene behavior. Although this is a 
strong simplification, this type of discrete dynamics has been successfully applied in 
a number of studies that are concerned with the co-evolution of network dynamics 
and -structure [2-4]. Finally, we are interested in understanding the role evolution 
might play in selecting particular network topologies. This is explored by asking 
how genome structure changes when those networks with certain dynamical prop- 
erties are preferentially selected. Similar questions have been addressed in a small 
number of previous proof-of-principle studies using artificial genomes [1,10,13], how- 
ever, without relating the observed adaptation to changes in sequence and network 
topology. In particular, we investigate a scenario of stabilizing selection similar to 
previous studies concerned with the evolution of developmental canalization [4], and 
evolution of gene regulatory networks in changing environments [] . 

We find evolution towards robustness of regulatory dynamics against both noise, 
modeled as fluctuating initial conditions, and against mutations. We show that, in 
principle, this phenotypic robustness can be traced back to adaptive changes on the 
sequence level that lead to emergence of more robust regulatory networks. 
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Fig. 1. Schematic description of artificial genome construction (after [17]): a fixed sequence of Ip 
digits - here '0101' (red boxes), /p = 4 - is defined as promoter sequence. Wherever it occurs 
on the string defining the genome, the next Ig digits are defined as "genes" (here, Ig = 6, green 
boxes). If the gene is active, a transcription factor (TF) is produced by increasing each digit of 
the gene sequence by 1 (yellow), crudely mimicking transcription and translation. The algorithm 
searches matching sequences anywhere in the genome in binding regions (blue box in the lower 
string) between genes, defining binding sites (BS) for the TF. The so-defined directed regulatory 
interaction from gene 1 to gene 2 can either activate or inhibit transcription of the next gene 
downstream (gene 2, dashed arrow). For details, see section [2. ll in the text. 



2. Model Details 

2.1. Regulatory network construction from random sequences 

An artificial genome can be constructed as follows (also see Fig. [Ij. Randomly 
string together S integers drawn uniformly between and 3. The use of 4 digits 
need not be the case, but does provide correspondence with the ATGC alphabet of 
real genomes. For the purpose of generalization, the length of the alphabet in the 
artificial genome may in principle take any positive integer value A. Next, define a 
base promoter sequence of length Ip to indicate the position of genes in the genome, 
say '0101'. Wherever the promoter sequence occurs, the next Ig digits are specified as 
a gene. Translation of the gene sequence into a protein occurs simply. Each number 
in the sequence is incremented by 1 and any values greater than the last number 
in the base set of digits become the first number {e.g., the gene '012323' becomes 
the protein '123030'). Binding sites are determined by searching the genome for the 
protein sequence. If a match is found, then the protein is a transcription factor (TF) 
that binds to that site and that regulates the next downstream gene. In case there 
are multiple binding sites of this TF for this gene, only one of them is counted for 
network construction. TFs may enhance or inhibit gene activity. In this study each 
TF has equal contribution to a gene's state and has equal probability of activating or 
suppressing gene expression. In real genetic systems, a TF may activate some genes 
and inhibit others, depending on a complex interplay between various factors that 
do not only depend on sequence. In our study, we make the simplifying assumption 
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that a TF is either activating or inhibiting, which is determined by the sum Sg of its 
sequence: if Sg < {l/2)srnax, where Smax = ~ is the maximal possible cross 
sum value, it is inhibiting, otherwise it is activating. Activation and inhibition are 
reflected by different weight values in the interaction matrix (network) defined by 
all TFs and their corresponding binding sites (cf. section 12. 2p . 

Clearly this model greatly simplifies the true transcription, translation, and 
binding processes. The binding of a real transcription factor to a cis-site, for ex- 
ample, depends on the protein's structure, shape, and environment, rather than a 
simple template matching approach. Moreover, there is a stochastic element to all 
these processes that is simply ignored here. 

Although it represents a strong simplification, the model does have biologi- 
cal justification [17]. The use of a base promoter sequence is reminiscent of the 
TATA box frequently found in eukaryotic organisms. Binding is modeled in a DNA- 
specific way, just as in real organisms. Additionally, the model has the potential 
for greater extendability than some models {e.g., Boolean networks) because it in- 
cludes DNA-specific transcription, translation, and binding. The impact of single 
base pair mutations on gene function and network structure can be studied with 
this model, and also the effect of sequence duplications (resulting in gene duplica- 
tion) or -deletions [16]. In this paper, we will restrict ourselves to single base pair 
mutations, and keep the genome size constant, both with respect to the number of 
bases S and the number of genes N. 



2.2. Regulatory dynamics 

Dynamics of state changes (activity or inhibition of genes) on the constructed net- 
works can be defined in various ways. In our study, we apply random threshold 
network (RTN) dynamics: An RTN consists of N randomly interconnected binary 
sites (spins) with states ai ~ ±1. For each site i, its state at time i -I- 1 is a function 
of the inputs it receives from other spins at time t: 

"^^' + '^ = 1-1, /.(^)<0 W 

with 

N 

m) = Y.''.,(y,{t) + h. (2) 

The N network sites are updated synchronously. In the following discussion the 
threshold parameter h is set to zero. The interaction weights Cy take discrete values 
Cij = -|-1 (activation) or —1 (inhibition); whether a given interaction is activating 
or inhibiting, is defined by the TF it is derived from, as explained in section [2. II If 
i does not receive signals from j, one has Cij — 0. 

For a finite system size TV, the dynamics of RTN, which are closely related 
to Boolean networks [12] converge to periodic attractors (limit cycles) after a finite 
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number of updates. It has been suggested that different limit cycles may correspond 
to different gene expression states (cell types) [12]. This property of RTN is also 
advantageous for defining phenotypes in artificial evolutionary scenarios that are 
subject to various kinds of selective pressure [4]. 

3. Statistical properties of the artificial genome 

In the following, N denotes the number of genes in the artificial genome, S the 
number of bases, Ig the length of gene sequences, Ip the length of promoter sequences 
(both are fixed and identical for all genes), and A the length of the alphabet. We 
now show how these quantities are interrelated via the combinatorial construction 
of the artificial genome, as outlined in section \2A\ 

3.1. Statistical distribution of Ibind 

Let us first derive the statistical distribution of lengths 4md of the binding regions 
preceding promoters in the artificial genome. We incrementally draw a sequence of 
random digits (bases) from the alphabet. Once we have drawn at least s > Ip bases, 
the probability that a promoter sequence is generated by chance with base s is 
Pp = (1/A)'p, since the last Ip digits must have position-specific values according to 
the predefined promoter sequence, and each of these values has probability 1/A to 
occur. Hence, the probability distribution of the number X of Bernoulli trials (i.e. 
the sequence length) needed to get one success (a promoter sequence) is a geometric 
distribution for s > Ip and zero otherwise. 



Since the last Ip digits constitute the promoter, the length of the preceding 
binding region is given by 4md = s — Ip, and it follows 



which is a decaying exponential distribution with a = — In (1 — A 'p). 

From Eq. |4] follows that the average length of binding regions is given by 



which is the mean of the geometric distribution. 
3.2. Genome size scaling 

From Eq. [3] follows that on average, we have to draw A'p + Ip — 1 bases to obtain 
a promoter sequence; the next Ig bases are defined as the associated gene. Hence, 




(3) 



Pihind) = Pp(l -PpY"'"-" 

= X-''"{1 - A"'")'"""' 
= X^'-" exp [-a ■ hind], 



(4) 
(5) 



{Ibtnd) = X" - I, 



(6) 
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to produce genomes with exactly N genes, the expectation value for the number of 
bases S that we have to string together is 

{S)^N -{X'- +lp-l + lg). (7) 

If we instead keep S fixed and ask for the expected number of genes, under the 
assumption that Ip < Ig <^ A'p, which holds for typical values considered in this 
study (e.g., Ig — 6, Ip = 4 and A = 4), we conclude that 

(TV) « i- . S. (8) 

3.3. Network connectivity 

In this section, we relate the previously derived statistical properties of the artificial 
genome to characteristic parameters of the resulting random networks. 

3.3.1. Average connectivity 

For a given TF, the probability to match to a random base sequence of length Ig 
is given by Pbmd = A~'9. There are n :— {Ibind) ^ + 1 subsequences of length Ig 
in a binding region of expected length {Ibind) ■ The probability that none of these 
matches the TF sequence is 

PO ^ {I - PbtndT , (9) 

thus the probability that the TF provides at least one input to the gene defined by 
the promoter sequence following a binding region is El 

P^nput - 1 - PO = 1 - (1 - A-'<')<''""''>-'« + l. (10) 

Since, in a genome with N genes, we have N binding regions and N transcription 
factors, the total number of regulatory interactions (ktotai) per genome (averaged 
over the whole ensemble of possible random genomes) scales quadratically with the 
number N of genes, 

{ktotal) = Pinput ■ N"^, (11) 

and the slope depends on A, Ig and Ip. It follows that the average connectivity 
(wiring density) {k) := {ktotal) /N scales linear with N. 

Notice, however, that the average number of regulatory interactions K obtained 
from a particular genome realization can substantially deviate from (fc), since the 
possible values of K are approximately Gaussian distributed [10]. 

''There is a finite chance that the same TF can bind more than once in a given binding region, 
however, since the update scheme for network dynamics requires uniquely defined connections, we 
assign only one regulatory input in this case. If the distance between promoters is smaller than Ig, 
no binding occurs, however, for typical parameter values of A, Ig and Ip as applied in our study, 
this is a very unlikely event and can be neglected. 
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Fig. 2. The probability of having Kout regulatory outputs (a) and the probability of having Ki„ 
regulatory inputs for random genomes with different gene lengths Ig, averaged over 10* realizations. 



3.3.2. In- and outdegree distribution 

From the above considerations, it is straight-forward to derive the statistical dis- 
tributions for the number of ingoing and outgoing hnks in randomly constructed 
genomes. As denoted in section [2?11 (see also Fig. 1), the subsequent processes of 
transcription/translation of gene sequences (incrementation of each number in the 
gene sequence by 1), defining transcription factors TF, and binding of the TF to 
subsequences of the base string by template matching, defines a network of directed 
regulatory interactions. A given TF represents an out-link of the gene that codes 
for it, and an in-link for all other genes that have binding sites for this TF. By defi- 
nition of template matching, each TF has equal probability pbind = to bind at 
any region of the base string (cf. section [3.3.ip . and hence generation of out-links is 
a Poisson process [7]. Consequently, the outdegree distribution is a Poissonian (Fig. 
2a): 

Pikout) = exp [-{K)]. (12) 

The number of inputs a gene receives from other genes, however, is proportional to 
the length Ibind of its associated binding region, hence, it follows from Eq. [5] 

P(fc„) ~exp[-/3fe,„], (13) 

i.e. the indegree distribution is exponential. Both results are confirmed by numerical 
simulations (Fig. 2a and 2b). 



3.4. Relevance to biology 

Clearly, random genome realizations ar far from being a realistic model of real 
biological genetic systems. However, it can be shown that even this extreme over- 
simplification has some relevance for biology. In Figure [31 the predicted number 
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Fig. 3. The number of genes predicted from the model as a function of genome size S 
with /p = 5 (hne). Data points (+) show the number of genes in 50 prokaryotic organ- 
isms, for which complete sequence information is available. Observed data are taken from 
I htt p : / / www . ult ranet . com/ ~j kimball/ [ 

of genes in a genome, N = (l/4)'p ■ S", is plotted as a function of genome size for 
Ip = h. Observed data from 50 prokaryotic organisms that have been completely 
sequenced are also shown. The correspondence between model and data is excel- 
lent for this range of S and shows that a combinatorial method for determining 
the number of genes in a genome is appropriate. For larger 5, as typically found 
in eukaryotic organisms, Zp = 7 is reasonable (not shown), but little observed data 
exists. On the other hand, statistical distributions of regulatory inputs and out- 
puts do not match biological data particularly well; here, more realistic statistics 
can be obtained by constructing artificial genomes from duplication and divergence 
events [16]. However, even in these models the question how selection pressure on 
the phenotype, as encoded by network dynamics, may influence genome organiza- 
tion, remains unanswered. This type of question shall be addressed in the remaining 
part of this paper. 

4. Stabilizing selection for a phenotype - an evolutionary scenario 

Though evolved by the random processes of genetic drift and selection pressure from 
changing environments, real genetic systems are far from being random. Complex or- 
ganization in genome structure is often connected to the highly non-linear nature of 
the genotype-phenotype map, which includes an intermediate layer of complex reg- 
ulatory processes controlling cell machinery (unicellular organisms) or highly struc- 
tured developmental processes (multicellular organisms) . The multilevel-structure of 
the involved evolutionary processes is sketched schematically in Fig. 4.: the genome, 
i.e. the DNA sequence, codes not only for structural proteins, but also for a complex 
gene regulatory network (GRN). The dynamics of this GRN regulates the devel- 
opment of the phenotype. The environment influences development twofold, first 
by perturbations of the developmental process (noise), second by selection pres- 
sure for viable phenotypes. Organisms reproduce by duplication of their genome, 
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which is an imperfect process frequently leading to errors (mutations). Typically, 
models of evolutionary adaptation focus either on sequence evolution or network 

structure alone, and hence imply a huge loss of information as compared to the true 
multi-level and multi-scale evolutionary dynamics. Artifical genomes could be an 



Fig. 4. Multilevel structure of the evolving gcnotype-phenotype map: besides coding for structural 
proteins, genomes also encode their own regulation by complex gene regulatory networks (GRN). 
GRN control the temporal and spatial dynamics that leads to "production" (development) of the 
organism (phenotype). The phenotype is reproduced by genome duplication, involving mutations. 
The environment can perturb developmental dynamics, as well as it selects for viable/adaptive 
phenotypes. 

important step towards models that integrate these levels, and hence may lead to 
predictions on the effects of adaptive processes on sequence- and network evolution, 
and how these are related to each other. 

4.1. Definition of the evolutionary algorithm 

In this section, we briefly explore an example of an evolutionary scenario based 
on an artificial genome, motivated by the observation that development is highly 
canalized, i.e. buffered against both intrinsic and environmental noise, and muta- 
tions [19] . A number of studies has demonstrated that stabilizing selection for par- 
ticular phenotypes leads to emergence of this high robustness, strongly facilitated 
by the high amount of neutrality contained in the fitness landscapes of complex 
regulatory networks [8]. Let us now define an evolutionary algorithm of stabiliz- 
ing selection in a strongly fluctuating environment, based on an artiflcial genome. 
We start by generating an initial population of randomly assembled genomes: the 
number of bases S is constrained such that each string contains exactly 64 genes. 
In all simulations discussed in the following, a promoter length Ip = A and a gene 
length /g = 6 are applied. Next, different limit cycles of the associated RTNs are 
identified by running network dynamics, as defined in section 2.1.1, from 10'* dif- 
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ferent random initial state configurations. This process is stopped when a RTN is 
identified which has a fixed point Sf (a Hmit cycle of length one). In addition, we 

require that there can be identified at least 4 additional attractors, motivated from 
phenotype diversity frequently observed in many species [15]. Adaptation to unpre- 
dictable environments is often handled by stochastic switching between phenotypes 
and can load to stabilization of even very small subpopulations of phenotypes that 
differ from the population majority [15], which we model by the requirement that 
the relative weight of the basin of attraction leading to Sf should be small (less 
than 40% of the tested configurations). Sf is the phenotype we want to stabilize, 
and the digit string Gf, that codes for its regulatory network, is the genotype we 
evolve. 

We now apply stabilizing selection for Sf as follows: 

(1) Create a mutant Gf by random single base mutations, occurring with a prob- 
ability Pm = 0.001 per base. 

(2) Run RTN dynamics from a random initial state, until an attractor is reached, 
otherwise stop after 200 iterations. 

(3) If dynamics has converged to Sf, keep Gf, otherwise keep Gf. 

(4) For the next generation, iterate from (1). 

We note that we disregard mutations of promoter sites, as well as mutation lead- 
ing to now "genes", to avoid complications resulting from a varying genome size. 
Notice that, in step (2), we test only one initial configuration, corresponding to the 
fact that biological organisms are tested only against the environment they face 
at the current generation. Robustness against fluctuations, i.e. the capacity to sta- 
bilize the phenotype under diverse perturbations of development by unpredictable 
environmental influences as well as internal noise, is measured by running RTN 
dynamics for Gf (Gf) for a larger set Z of initial configurations (e.g. 10^ random 
initial states). This variation in initial states simulates the fact that neither all a 
organisms in one generation meet a homogeneous environment, nor environments 
are constant over the course of generations. Then 

Rfit) := ^ (14) 

defines the robustness against fluctuations, where Zf{t) is the fraction of initial 
states that lead to Sf at generation t. A second measure of robustness is associated 
to the capacitance to buffer the system against disadvantageous mutations (muta- 
tional robustness Rm, [4]). At each generation we measure the number of accepted 
mutants Pa in the previous P generations, and define 

Rm{t) := (15) 

If Pa, and hence Rm increases with t, this indicates restructuring of the genome 
such that the probability of neutral or advantageous mutations with respect to Sf 
has increased. 
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generation generation 



Fig. 5. Time course of evolutionary dynamics. Left: Evolution of the robustness R[ against fluc- 
tuations in initial conditions, example of a particular run (thin-lined curve) and ensemble- average 
over 67 different evolutionary runs (thick-lined curve). Right: Evolution of the mutational robust- 
ness Rm, the dashed curve represents an example of an evolutionary run, the thick-lined curve the 
ensemble average. 




500 1000 1500 2000 

time (generations) 

Fig. 6. Number of different dynamical attractors, as identified by sampling dynamics from 10* 
random initial conditions in each generation, in a typical evolutionary run. Inset: statistical distri- 
bution of the number of indentified attractors, sampled from 67 evolutionary runs. The distribution 
exhibits an exponential decay (straight line shown for eye guidance). Notice that in about 2% of the 
cases, no attractors was identified due to the imposed length constraint on dynamical trajectories. 

4.2. Results 

Next, let us summarize the results obtained from evolutionary runs, starting from 
different random genome realizations with parameters as outlined in the previous 
paragraph. 

4.2.1. Evolution of robustness 

Figure [5] shows both quantities in a typical evolutionary run, and ensemble aver- 
ages obtained from 67 evolutionary runs starting from different random genome 
realizations. Both Rf and Rm increase rapidly, however, exhibiting considerable 
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Fig. 7. Robustness Rf against fluctuating dynamical initial conditions as a function of the ro- 
bustness Rm against point mutations of the genome sequence. Crosses show example results of a 
single evolutionary run, the lined curve is the ensemble average over 67 different runs. Rf and Rm 
exhibit a clear positive correlation. 

fluctuations. In particular, Rf exhibits an interesting intermittent dynamics remi- 
niscent of a punctuated equilibrium [3] , indicating metastability of the evolutionary 
dynamics. In fact, in most evolutionary runs we studied Rf and Rm could be sta- 
bilized only over a finite number of generations, as indicated in Fig. by the sharp 
decrease of both quantities around t = 1500. The metastability is also visible in the 
ensemble average oi Rf (Fig. [SJ left panel), which, after an initial sharp increase 
up to Rf « 0.4 shows a slight decline over the following generations. Another mea- 
sure that can be applied to characterize the evolution of network dynamics is the 
number of different attractors (limit cycles) that are identified by the evolutionary 
algorithm in successive generations. Ideally, when stabilizing selection always suc- 
ceeds, only the fixed point attractor corresponding to the phenotype 5/ should be 
present. Figure [6] shows that indeed most of the time the number of attractors is 
very small, however, there are intermittent increases (bursts). The inset of Fig. [6] 
shows that the statistical distribution of this quantity, as obtained from multiple 
evolutionary runs, exhibits an exponential decay. As it will be discussed later on, the 
"punctuated equilibrium" of evolutionary dynamics is both related to the selection 
criterion we chose, and to the mutation dynamics of the artificial genome, which 
is considerably different from single-link rewiring, as applied in most comparable 
"network only" studies. 

Rm is a measure of the probability that mutations are advantageous or neutral. 
In particular, neutrality of mutations strongly facilitates to find better phenotypes, 
since it allows evolution to explore a large number of different system configura- 
tions potentially leading to better phenotypes. When neutrality is a driving force 
of evolutionary dynamics, we expect that robustness against deleterious mutations 
and fitness of evolved phenotypes, i.e. Rm and Rf, are correlated. As becomes evi- 
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dent from Fig. Hi?/ and Rm are indeed positively correlated, similar to the results 
reported e.g. in [4]. 

The artificial genome now allows us to further investigate the effects of this 
evolutionary dynamics on both network and sequence structure. 




Fig. 8. Cumulative degree distributions Pc{k) of incoming regulatory links (indegree distribution) 
for generation 1, 500 and 1000, averaged over 100 evolutionary runs. Inset: Relative difference 
A := {pc{t = 1000) — pc{t = l))/pc{t = 1) between the distributions at generation 1 and 1000. 
Notice the increase around k = 18, followed by a decrease for A: > 20 (details are discussed in the 
text). 




500 1000 1500 2000 2500 -100 -50 o 50 lOO 



time (generations) #of links added/removed 

Fig. 9. Left panel: Evolution of the average network connectivity in three different runs. Right 
panel: statistical distribution of the number of links added/removed by point mutations in succes- 
sive generations of evolving networks. Statistics was taken over all accepted mutants, averaged over 
67 different evolutionary runs. About 50% of mutations are structurally neutral (no link added or 
removed), the rest shows a broad spectrum of rewiring effects at the network level. 
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Fig. 10. Cumulative number of base exchanges during evolution for different positions on the 
genome, averaged over regions containing 100 bases each, during the course of evolution from 
generation 1 to 1000 in a particular evolutionary run. The brightness in grayscale indicates the 
number of bases exchanges. Increasing ruggedness of the surface points at divergent evolution of 
genomic regions accumulating base changes with rates that differ by orders of magnitude. 




cum. # of base exchanges per region 



Fig. 11. Statistical distribution of the cumulative number of base exchanges per region at gen- 
eration 2000, averaged over 100 different evolutionary runs. Left curve (+): evolution with the 
dynamical (robustness) constraint as described in the text. The distribution has a complex mul- 
timodal structure with maxima/plateaus around 0, 40 and 80 (arrows). The lined curve on the 
right shows the same distribution for dynamically unconstrained evolution (keeping only promoter 
sequences fixed). 

4.2.2. Evolution of network- and sequence structure 

Let us first look on the evolution of network structure under the imposed dynamical 
robustness constraint. We performed 100 evolutionary runs with different initial as- 
signments of G/ and Sf] each simulation was observed over 2000 generations, and 
regulatory networks evolved after 2000 generations were compared to the initial net- 
works. With regard to average network connectivity and outdegree distributions, no 
substantial reorganization was found. However, moderate reorganization is found in 
the distribution of regulatory inputs, which is shown in Fig. [5] (for smoothing of 
data, cumulative distributions Pc{k) probability to observe a node with indegree 
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> k are shown, besides averaging over 100 different evolutionary runs). Typically, 
an increase of probability for intermediate values of k is found, while probability for 
larger k is reduced. This can be clearly appreciated by investigation of the differ- 
ence between both distributions (inset of Fig. [8]). However, the overall shape of the 
distribution does not change significantly and still stays close to an exponential, pre- 
sumably due to the small network sizes and limited number of generations observed. 
Figure [9] (left panel) shows the time evolution of the average network connectivity 
K for three different evolutionary runs. K shows considerable variance both with 
regard to different initial random genome realizations evolution starts from, as also 
with regard to fluctuations during evolution. In particular, the effects of point mu- 
tations at the sequence level on network wiring are strongly non-linear. To show 
this, we measured the statistical distribution of the number of regulatory links that 
were deleted or added in successive generations of accepted mutants, averaged over 
100 different evolutionary runs (Fig. [Ql right panel). In about 50% of the cases, 
mutations did not affect network wiring; in the remaining cases, most often only 
one or a few links were affected (see the sharp peak around zero), however, there 
are also cases were a large number of links is added or removed simultaneously. This 
result is in contrast to many other studies of network evolution, which implicitly 
assume small, stepwise local changes in network wiring and hence only small moves 
along neutral paths of the fitness landscape. In our model, this is still the most 
frequent case, however, in some instances also larger jumps between different peaks 
of the fitness landscape naturally emerge through mutations in the sequence-based 
encoding of network structure that affect a large number of regulatory interactions. 

Last, let us investigate how evolution proceeds at the most basic level of the 
system, i.e. the digit sequence of the artificial genome. Figure 10 shows the number 
of base exchanges during evolution for different positions on the genome. At each 
generation, the cumulative number of base substitutions in successive slices of 100 
digits on the genome string, identified by a unique region index, during all previous 
generations was monitored. Increasing ruggedness of the surface points at diver- 
gent evolution of genomic regions accumulating base changes with at very different 
rates, giving evidence that there co-exist highly conserved and "adaptive" regions. 
We hypothesize that the former encode the invariant "core" of the regulatory net- 
work needed to produce the phcnotype, while the latter contain neutral mutations, 
or support buffering against fluctuations. In Fig. 11, the statistical distributions for 
cumulative number of base substitutions in evolved genomes are compared to the 
control experiment without robustness constraint, only requiring preservation of 
promoter- and gene sequences. While in the control experiment, a simple, symmet- 
ric binomial distribution is found, evolved networks exhibit a strongly asymmetric, 
multi-modal distribution with at least three identifiable maxima/plateaus (indi- 
cated by arrows). This demonstrates that the selective constraint on regulatory 
dynamics indeed strongly influences the evolutionary patterns found in the genome 
at sequence level. 
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5. Discussion and Conclusions 

We studied statistical properties of the artificial genome model proposed by Reil [17] 
both on the level of base sequences and regulatory networks generated with this 
model, and the evolution of developmental canalization (robustness against noise 
in initial conditions of regulatory dynamics and against single base mutations) . We 
find that random realizations of the artificial genome exhibit pronounced differences 
between the statistical distributions of regulatory inputs and outputs, and a scaling 
in the number of genes (as a function of the number of DNA bases) compatible with 
corresponding data of prokaryotic organisms for choices of model parameters as typ- 
ically applied in our simulations. The simulation of evolutionary dynamics yields 
a number of surprising results. First, we observe that robustness is an cvolvable 
property, and in particular that robustness against deleterious mutations and ro- 
bustness against noise are correlated, similar to results of other studies [4]. However, 
while in most "network only" studies (without a sequence based description) only 
small adaptive changes (rewirings) are considered, we find emergence of highly non- 
linear effects between sequence point mutations and network wiring (as predicted 
in [22] for random genome realizations), including a large number of structurally 
neutral mutations, and mutations that lead to simultaneous rewiring of multiple 
connections. This means that, while stepwise evolution along neutral paths of the 
fitness landscape with regard to plienotypic effects of mutations [3,4,8] is still the 
main driving mechanism, also larger jumps between different peaks of the fitness 
landscape are possible. Interestingly, we found evidence that this increased non- 
linearity in the genotype-phenotype map and the resulting fitness landscape tends 
to weaken the effectiveness of stabilizing selection in the long run, and the degree 
of evolved robustness exhibits considerable fluctuations during evolutionary runs. 
Concerning network structure, we found only moderate reorganization of the sta- 
tistical distributions of input- and output numbers per node during evolution. In 
contrast, evolution leaves clearly visible signs at sequence level with a pronounced 
pattern of strongly conserved regions, and other parts of the genome evolving in a 
much less constrained way. 

To conclude, the results of our study indicate that artificial genomes represent an 
interesting step towards more realistic models for the evolution of gene regulatory 
networks (GRN), by taking into account the indirect evolution of GRN structure 
through mutation of regulatory sequences, which cannot be accounted for in "net- 
work only" models. Clearly, the results of the current study are limited in scope; in 
future extensions of the model, we will in particular address variations in the num- 
ber of genes (e.g. resulting from sequence duplications), and more realistic models 
for the binding of transcription factors to regulatory binding sites. 
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